Under ideal conditions and recording methods, the GPS on devices like smartphones typically has a precision of about ±5m, even though it will display coordinates that are much more precise. Relatively minor inaccuracies in the sample locations (5-20m) could result in a lot of miscategorizations and miscalculations when aligning with local GIS layers, particularly highly detailed layers (like the habitat layer or landuse data). I tried to characterize the amount of uncertainty, think of different ways to filter the data, and explore a couple of options for categorizing samples. Briefly:
GEOPRECISION field specifies whether the coordinates were measured, extrapolated, etc. We can remove bad extrapolations.HABITAT field gives descriptions of where the ants were collected. These can be cross-referenced with the habitats extracted by location, under the assumption that the HABITAT field is more accurate.I also included some EDA for distances to the nearest road (and nearest road type), distances to the nearest building, and species accumulation curves using the habitat assigned with the 5m buffer (though I’m not convinced using the habitat layer from the structured samples is wise, at least for all categories).
GEOPRECISIONOne concern with extracting local variables like habitat, land use, distance from the nearest road, or distance from the nearest building is that that requires a lot of confidence in the latitude and longitude associated with the point locations. The column GEOPRECISION indicates whether the location was extrapolated, corrected, or measured (or some combination).
| GEOPRECISION | Tubes | Percent | Percent (non-NA) |
|---|---|---|---|
| mesuré | 6159 | 89.5% | 89.6% |
| extrapolé | 624 | 9.1% | 9.1% |
| extrapolé/corrigé | 44 | 0.6% | 0.6% |
| extrapolé mauvais | 17 | 0.2% | 0.2% |
| NA | 12 | 0.2% | - |
| mesuré/corrigé | 11 | 0.2% | 0.2% |
| extrapolé (base tube précédent) | 6 | 0.1% | 0.1% |
| extrapolé (église par défaut) | 5 | 0.1% | 0.1% |
| extrapolé (gare par défaut) | 4 | 0.1% | 0.1% |
| extrapolé/corrigé (église par défaut) | 1 | 0.0% | 0.0% |
The coordinates were mostly measured directly by the collector, and only a small proportion were extrapolated badly. In theory, we could assume that mesuré, extrapolé, extrapolé/corrigé, and mesuré/corrigé indicate that the coordinates can be used directly.
The number of reported digits is an estimate of precision for coordinates reported in decimal degrees, but not for the Swiss coordinate system which reports 6 digits no matter what. For latitude and longitude at the equator, an arc-degree corresponds with about 111km. At a longitude of 46ºN, an arc-degree is 76.5km.
| Decimals | Precision (Lat.) | Precision (Lon.) |
|---|---|---|
| 1 | ± 5500 m | ± 3825 m |
| 2 | ± 555 m | ± 383 m |
| 3 | ± 55.5 m | ± 38.3 m |
| 4 | ± 5.55 m | ± 3.83 m |
| 5 | ± 0.555 m | ± 0.383 m |
| 6 | ± 0.0555 m | ± 0.0383 m |
The reported digits can be used to set a minimum bound if, e.g., only 2 digits are reported, but typically devices will report many digits even if they are not justified. There were 3945 tubes (57.3%) reporting the coordinates in decimal degrees, with the rest using the 6-digit Swiss coordinates and no estimate of precision. The decimal degree coordinates include 686 tubes with coordinates extrapolated based on the reported locality. The reliability of the extrapolated coordinates for extracting local variables like habitat or land use type rely on a clear description of the habitat by the collector.
| Decimals | Tubes | Percent |
|---|---|---|
| 1 | 1 | 0.0% |
| 2 | 52 | 1.3% |
| 3 | 165 | 4.2% |
| 4 | 670 | 17.0% |
| 5 | 669 | 17.0% |
| 6 | 1226 | 31.1% |
| 7 | 229 | 5.8% |
| 8 | 933 | 23.7% |
Typically, smartphones are accurate under good conditions to about 5m in radius, with worse performance around buildings, bridges, trees, etc. It therefore seems likely that coordinates with >5 decimal places are overestimating precision. More importantly, the 5.5% of locations with fewer than 4 should not be taken as-is with a high degree of confidence. Again, this metric isn’t possible with the locations recorded with the Swiss coordinate system (2938 tubes: 43%), but it seems reasonable that the distribution of precision would be roughly similar.
For extracting local conditions based on point locations, it seems reasonable to buffer all points with 5-10m, with the local habitat or land use type assigned as the dominant category within the buffer. The buffer should not affect distance to nearest road, aside from reducing most distances by a uniform amount and reducing points with distances less than the buffer radius to 0m.
It is also a good idea to remove tubes with GEOPRECISION == "extrapolé mauvais" and possibly "extrapolé (base tube précédent)", "extrapolé (église par défaut)", "extrapolé (gare par défaut)", "extrapolé/corrigé (église par défaut)" as the uncertainty seems likely to be greater than 5-10m. Lastly, tubes with fewer than 3 decimals for the lat/lon coordinates should also be removed for the same reasons. This should maybe be even more strict.
geo_exclude <- c("extrapolé mauvais",
"extrapolé (base tube précédent)",
"extrapolé (église par défaut)",
"extrapolé (gare par défaut)",
"extrapolé/corrigé (église par défaut)")
dec_thresh <- 3 # remove lat/lon with 0-3 decimals
pub_filt <- ant$pub %>%
filter(!is.na(GEOPRECISION)) %>%
filter(!GEOPRECISION %in% geo_exclude) %>%
filter(is.na(LATITUDE) | nchar(LATITUDE) >= (dec_thresh+3)) %>%
filter(is.na(LONGITUDE) | nchar(LONGITUDE) >= (dec_thresh+2))
pub.5m <- pub_filt %>% st_buffer(dist=5)
pub.10m <- pub_filt %>% st_buffer(dist=10)
There are three land cover / land use datasets available:
CORINE and the Opération Fourmis dataset have full coverage across Vaud, while the detailed land use dataset is mostly restricted to open canopy areas in the lower elevations (OpFo, CORINE, VD).
Here is a random area within Vaud showing the differences. The grid is 1km x 1km, with the public inventory tubes shown as the small black points (with 5m and 10m buffers), and building footprints from open street maps. (OpFo, CORINE, VD).
Zoomed in on the edge of a town (OpFo, CORINE, VD):
Zoomed in close on a couple of points (OpFo, CORINE, VD):
Using the same habitat categories as the structured samples (first column of plots above), we can calculate the habitat for each tube as the point location, the dominant habitat within 5m, and the dominant habitat within 10m.
We can assign the habitat type for each tube as either the habitat at the point location, ignoring uncertainty, or the dominant habitat within the 5m or 10m buffer. Larger buffers will obviously include more habitat categories, and samples collected along roads or edges would most likely be mis-categorized since those habitat types are unlikely to have the greatest coverage within a 5m or 10m radius. Conversely, even slight inaccuracies in the coordinates would result in mis-categorization of these samples based on the point locations. Assigning a habitat to each point with any degree of confidence is not trivial.
HABITAT: Cross-referenceHABITAT entriesMany of the habitats used for the OpFo structured samples are unlikely to have direct matches that would allow for unambiguous categorization. Searches for keywords could give an idea of how well the extracted habitat matches the stated habitat for the (mostly) unambiguous keywords.
Forests are generally large habitat polygons, and most HABITAT descriptions including the word forêt should be describing tubes collected in forest habitat. Edges, borders, and clearings can be filtered out to look at a sort of ‘best case’ scenario.
## Descriptions with 'for*t': 412
| Categorie | n_pt | n_5m | n_10m | pct_pt | pct_5m | pct_10m |
|---|---|---|---|---|---|---|
| Autre | 23 | 21 | 25 | 6.9% | 6.3% | 7.5% |
| CulturePerm | 1 | 1 | 1 | 0.3% | 0.3% | 0.3% |
| ForetConifere | 59 | 59 | 68 | 17.8% | 17.8% | 20.5% |
| ForetFeuillus | 44 | 43 | 47 | 13.3% | 13.0% | 14.2% |
| ForetMixe | 32 | 32 | 115 | 9.6% | 9.6% | 34.6% |
| lisiere | 18 | 20 | 19 | 5.4% | 6.0% | 5.7% |
| pierrier | 1 | 1 | 1 | 0.3% | 0.3% | 0.3% |
| transport | 115 | 117 | 22 | 34.6% | 35.2% | 6.6% |
| zalluviale | 7 | 7 | 8 | 2.1% | 2.1% | 2.4% |
| ZoneConstruite | 30 | 30 | 26 | 9.0% | 9.0% | 7.8% |
| NA | 2 | 1 | NA | 0.6% | 0.3% | NA |
The high proportion of point locations and 5m buffers classified as transport could reflect that the ants were collected along a road in the forest, or that the coordinates were recorded after returning to the car.
The a priori expectation is that the point locations should be somewhat better for narrow habitat types like lisière. I would also expect poor performance across all methods, since inaccuracy in the point location is likely to move the point outside the habitat polygon, and buffers will include more non-target habitat types.
## Descriptions with 'lisi.re': 205
| Categorie | n_pt | n_5m | n_10m | pct_pt | pct_5m | pct_10m |
|---|---|---|---|---|---|---|
| Autre | 80 | 80 | 94 | 39.0% | 39.0% | 45.9% |
| ForetConifere | 12 | 12 | 14 | 5.9% | 5.9% | 6.8% |
| ForetFeuillus | 20 | 19 | 20 | 9.8% | 9.3% | 9.8% |
| ForetMixe | 15 | 12 | 15 | 7.3% | 5.9% | 7.3% |
| lisiere | 33 | 35 | 19 | 16.1% | 17.1% | 9.3% |
| marais | 3 | 3 | 3 | 1.5% | 1.5% | 1.5% |
| pierrier | 8 | 8 | 8 | 3.9% | 3.9% | 3.9% |
| PrairieSeche | 1 | 1 | 1 | 0.5% | 0.5% | 0.5% |
| transport | 20 | 22 | 18 | 9.8% | 10.7% | 8.8% |
| ZoneConstruite | 13 | 13 | 12 | 6.3% | 6.3% | 5.9% |
| zalluviale | NA | NA | 1 | NA | NA | 0.5% |
The point locations and 5m buffer capture lisière about equally, but it is still only 17% of the tubes with lisi.re in the HABITAT description.
Like for lisière, the a priori expectation is that the point locations should be better for transport, but with relatively poor performance across all methods. There are many descriptions in HABITAT that use the word chemin, but that’s probably used more often for trails rather than actual roads.
## Descriptions with 'rue' or 'route': 105
| Categorie | n_pt | n_5m | n_10m | pct_pt | pct_5m | pct_10m |
|---|---|---|---|---|---|---|
| Autre | 33 | 33 | 37 | 31.4% | 31.4% | 35.2% |
| CulturePerm | 1 | 1 | 2 | 1.0% | 1.0% | 1.9% |
| ForetConifere | 2 | 2 | 2 | 1.9% | 1.9% | 1.9% |
| ForetMixe | 2 | 2 | 2 | 1.9% | 1.9% | 1.9% |
| lisiere | 5 | 5 | 4 | 4.8% | 4.8% | 3.8% |
| transport | 37 | 38 | 34 | 35.2% | 36.2% | 32.4% |
| ZoneConstruite | 25 | 24 | 23 | 23.8% | 22.9% | 21.9% |
| PrairieSeche | NA | NA | 1 | NA | NA | 1.0% |
More tubes with HABITAT descriptions including rue and route are classified as transport than any other category based on 5m locations, but it is still only about a third. Surprisingly, there is not much difference between the point locations and the buffers.
The ZoneConstruite category should also be unambiguous.
## ZC keywords: maison|appartement|étage|balcon|cuisine
## Descriptions with keywords: 123
| Categorie | n_pt | n_5m | n_10m | pct_pt | pct_5m | pct_10m |
|---|---|---|---|---|---|---|
| Autre | 4 | 4 | 4 | 3.3% | 3.3% | 3.3% |
| CulturePerm | 3 | 3 | 3 | 2.4% | 2.4% | 2.4% |
| ForetFeuillus | 1 | 1 | 1 | 0.8% | 0.8% | 0.8% |
| ForetMixe | 2 | 2 | 2 | 1.6% | 1.6% | 1.6% |
| lisiere | 4 | 4 | 3 | 3.3% | 3.3% | 2.4% |
| transport | 8 | 9 | 7 | 6.5% | 7.3% | 5.7% |
| ZoneConstruite | 101 | 100 | 103 | 82.1% | 81.3% | 83.7% |
Generally good correspondence, with minimal differences among buffers.
Samples collected in pastures should be classified pretty reliably as Autre or PrairieSeche. The table and figure exclude HABITAT descriptions that contain lisi*re.
## Descriptions with p*turage: 187
| Categorie | n_pt | n_5m | n_10m | pct_pt | pct_5m | pct_10m |
|---|---|---|---|---|---|---|
| Autre | 97 | 98 | 109 | 61.8% | 62.4% | 69.4% |
| ForetConifere | 2 | 2 | 2 | 1.3% | 1.3% | 1.3% |
| lisiere | 23 | 23 | 22 | 14.6% | 14.6% | 14.0% |
| PrairieSeche | 12 | 12 | 12 | 7.6% | 7.6% | 7.6% |
| transport | 18 | 17 | 6 | 11.5% | 10.8% | 3.8% |
| ZoneConstruite | 5 | 5 | 6 | 3.2% | 3.2% | 3.8% |
Tubes collected in pastures based on HABITAT description are distributed across only 6 types of extracted land cover, with most aligning with Autre. Sizable numbers mapped to lisière or transport, despite removing descriptions containing the word lisi*re.
HABITAT & OpFo habitatsSome of the HABITAT descriptions specify that they were collected in gardens. My expectation is that these tubes should almost entirely categorized as ZoneConstruite, Autre, and CulturePerm.
## Descriptions with 'jardin' or 'potag*': 257
| Categorie | n_pt | n_5m | n_10m | pct_pt | pct_5m | pct_10m |
|---|---|---|---|---|---|---|
| Autre | 23 | 23 | 23 | 8.9% | 8.9% | 8.9% |
| CulturePerm | 12 | 12 | 12 | 4.7% | 4.7% | 4.7% |
| ForetConifere | 5 | 5 | 7 | 1.9% | 1.9% | 2.7% |
| ForetFeuillus | 3 | 3 | 3 | 1.2% | 1.2% | 1.2% |
| ForetMixe | 2 | 2 | 2 | 0.8% | 0.8% | 0.8% |
| lisiere | 7 | 7 | 3 | 2.7% | 2.7% | 1.2% |
| transport | 28 | 28 | 16 | 10.9% | 10.9% | 6.2% |
| ZoneConstruite | 177 | 177 | 191 | 68.9% | 68.9% | 74.3% |
The buffers both place about 95% of the tubes in ZoneConstruite, Autre, or CulturePerm, compared with 83% of the point locations, which include a higher percentage of transport.
As a first approximation, we could classify points as inside or outside urban areas using the CORINE land cover categories (1XX indicate human-dominated types). To categorize tubes as coming from gardens, there are two options: 1) use the HABITAT descriptions as above, including all tubes with jardin or potage in the description, or 2) using the 909 Jardin potager category in the Vaud land use dataset. Unfortunately, there are no tubes that are categorized as 909 Jardin potager based on location, regardless of buffering.
There are too few tubes with CORINE classifications of 111 Continuous urban fabric, which identifies parts of the few largest cities in Vaud. The 112 Discontinuous urban fabric identifies most (but not all) towns, so the comparison of gardens between cities vs. non-cities would need to be between urban and non-urban categories.
| Garden | Non-Urban | Urban |
|---|---|---|
| FALSE | 3961 | 2568 |
| TRUE | 66 | 191 |
Another possibility would be to categorize communes based on population (that’s the smallest unit I’ve found).
For the distribution of population sizes among communes, there isn’t much of a clear breakpoint aside from Lausanne.
Using the habitat types from the structured samples, the public dataset clearly overrepresents ZoneConstruite and underrepresents Autre.
Similarly with the CORINE dataset, category 112 Discontinuous urban fabric is very overrepresented, with clear underrepresentation for 211 Non-irrigated arable land and 312 Coniferous forest.
For the land use, many crops are underrepresented, while pastures tend to be overrepresented. This is not really surprising given where people would be expected to go to collect ants.
For each tube, we can also use the location to calculate the distance to the nearest road and/or building, and potentially what type of road it is. This could be interesting for roads, since the dataset from OpenStreetMaps distinguishes everything from paths to highways.
Here are maps for each different type of road, reducing them to only the top 15 most extensive categories (total length ≥ 97km).
Buffering points should have minimal influence on distance to the nearest road, since the distance would be reduced by the buffer radius uniformly. The exception would be points nearer to a road than the buffer radius, which would all have a distance of 0m. The type of road nearest to the coordinates should be similarly (mostly) unaffected, though it is possible that a buffer could intersect multiple types of roads. For now, let’s ignore that and just use the point locations.
Many samples were collected near paths, residential roads, and service roads.
## Summary of distances to the nearest road (m):
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 5.173 13.663 27.592 28.864 520.686
## Within 10m: 2766 tubes = 41 %
## Within 50m: 5852 tubes = 86 %
## Further than 200m: 79 tubes = 1 %
## Further than 400m: 11 tubes = 0 %
As should be expected, most points are quite close to a road or path, with 41% of samples within 10m. A small number are quite far from any trail in the dataset.
Unlike for roads, the building dataset doesn’t include any usages or descriptions, but consists of building footprints across the whole canton.
## Summary of distances to the nearest building (m):
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 10.36 54.48 142.81 213.40 2314.13
## Within 10m: 1646 tubes = 24 %
## Within 50m: 3277 tubes = 48 %
## Further than 200m: 1790 tubes = 26 %
## Further than 400m: 732 tubes = 11 %
We could make species accumulation curves to compare across habitat types or to compare with the structured samples.
Restricted to categories with greater than 10 tubes.
Using the garden categorization based on HABITAT descriptions as above, and the urban / non-urban categorization based on the CORINE dataset as above, we could maybe compare gardens in and out of urban zones.
Using pastures from the HABITAT descriptions as above, restricting to Autre and Prairie Sèche based on 5m buffers.
nTubes:nTubes:nTubes:For the public inventory dataset, we’ll use the categorizations from the 5m buffers. The geolocations of the structured inventory are highly reliable since they were carefully placed to align with the habitat layer. We can thus use the point locations directly. However this should be updated depending on the question: currently, only the detections from the structured samples are included, as if each tube were independent and as if the plots without detections did not exist. If the goal is to compare the ants, this is fine. If it is to compare sampling effort, it should be changed to use the plots instead of the tubes.
These plots compare the composition of the ant samples in each dataset to the composition of the habitats in Vaud. Thus, they show some combination of 1) discrepancy between the composition of the sampling effort compared to the composition of Vaud, and 2) discrepancy in the ant densities across habitats compared to Vaud. For the public samples, we don’t know the weight of each component, though almost certainly 1) is more important. For the structured points, we could standardize since we know the sampling effort.
The tubes in the structured samples are qualitatively more similar to Vaud than are the public samples. The largest discrepancies seem to be that the structured samples over-represent Prairie Sèche and Autre, and under-represent Forêt Conifère and ZoneConstruite. We could test whether these differences are significant (overall \(\chi^2\) is significant).
Again, the structured samples are generally more similar to Vaud than are the public samples. The biggest discrepancy between the structured samples and Vaud are over-representation of 321: Natural grasslands and 243: Agricultural land with significant areas of natural vegetation, and under-representation of 112: Discontinuous urban fabric, 211: Non-irrigated arable land, and 312: Coniferous forest. This largely matches with the OpFo categories, but with land that probably corresponds with pastures (321, 243) separated from land that probably correspondss more with crop (211). BUT I NEED TO LOOK TO SEE HOW THESE CATEGORIES ALIGN ACROSS DATASETS.
The structured samples include more samples from crop. Samples from 611 Prairies extensive are quite over-represented. In fact, the majority of prairies and pastures seem to be over-represented (601-623). This obviously makes sense if ants are more likely to be found in pastures and prairies than cropland.
The COLLECTORFIELDNUMBER column identifies each collector. The group field trips were entered as a single collection (e.g., COLLECTORFIELDNUMBER == "collector_formation1"). We could look at the impact of including these field trips (direct: tubes collected; indirect: proportion of participants sending in additional samples, etc), and also the impact of including a small number of experts, comparing them to the rest of the citizen science effort. DO WE KNOW THE COLLECTOR IDs FOR THE EXPERTS?
S:We could categorize the COLLECTORFIELDNUMBER as group trip, expert, public. For now, we can call anyone who collected >20 species an expert. I imagine the names are somewhere, and maybe we also call anyone in the DEE an expert?
## Experts:
## collector_0033, collector_0036, collector_0058, collector_0099, collector_0133, collector_0142, collector_0547
How did the number of collectors change across the year?
How did the number of new collectors change across the year?
The DATECOLLECTION column records the collection date reported by the collector. I’m not sure if it would be possible to look for effects of the group collecting trips or news stories. There isn’t anything obvious by day, but maybe by week… There are a couple of errors in the dates (year = 2018: excluded from plots).